System Overview Front-end Parameterization Acoustic 0 0 0 0 0 Acoustic Modelling Language Modelling Tahle 1: Iltk Lvr Performance Oil Arpa Csr I %word Vocab Size 1 Lm N-gram J Adapted Error Current State of Lvr Current Issues Real-time Operation
ثبت نشده
چکیده
progress has been made in speech-recognition technology over the last few years and nowhere has this progress been more evident than in the area of large-vocabulary recognition (LVR). Current laboratory systems are capable of transcribing continuous speech from any speaker with average word-error rates between 5% and 10%. If speaker adaptation is allowed, then after 2 or 3 minutes of speech, the error rate will drop well below 5% for dependent and required words to be spoken with a short pause between them. However, the capability to recognize natural continuous-speech input from any speaker opens up many more applications. As a result, LVR technology appears to be on the brink of widespread deployment across a range of information technology (IT) systems. This article discusses the principles and architecture of current LVR systems and identifies the key issues affecting most speakers. LVR systems had been limited to dictation applications since the systems were speaker their future deployment. To illustrate the various points raised, the Cambridge University HTK system is described. This system is a modem design that gives state-of-the-art performance , and it is typical of the current generation of recognition systems. Current LVR systems are firmly based on the principles of statistical pattem recognition. The basic methods of applying these principles to the problem of speech recognition were pioneered by Baker, Jelinek, and their colleagues from IBM in the 1970s, and little has changed since [13, 541. Figure 1 illustrates the main components of an LVR system. An unknown speech waveform is converted by a front-end signal processor into a sequence of acoustic vectors, Y = y i. y2, ...,Y T. Each of these vectors is a compact representation of the short-time speech spectrum covering a period of typically 10 msecs. Thus, a typical 10-word utterance might have a duration of around 3 seconds and would be represented by a sequence of T = 300 acoustic vectors. The utterance consists of a sequence of words, W = wi, w2,..wn, and it is the job of the LVR system to determine the most probable word sequence, W , given the observed acoustic signal Y. To do this, Bayes' rule is used to decompose the required probability P(WIY) into two components, that is, This equation indicates that to find the most likely word sequence W, the word sequence that maximizes the product of P(W) and P(YIW) must be found. The first of …
منابع مشابه
SRI November 1993 CSR Spoke Evaluation
In this paper we present SRI’s results on the 1993 ARPA CSR Spoke Evaluations. This evaluation used the same HMM acoustic models as those used in SRI’s hub system: gender-dependent Genonic HMM’s. The system was made robust by modifying the front end algorithms to estimate the cepstral features (the HMM models were not modified). The robust front-end used a wide bandwidth (100-6400Hz) and estima...
متن کاملModelling and decoding of crossword context dependent phones in the Philips large vocabulary continuous speech recognition system
The performance of the Philips system for large vocabulary continuous speech recognition has been improved signi cantly by crossword N-phone modelling, enhanced clustering of HMM-states during training, consistent handling of untrained HMM-states during decoding and a new e cient crossword N-phone M-gram decoding strategy. We report word error rate reductions of up to 18% on various ARPA test s...
متن کاملAn alternative front-end for the AT&T WATSON LV-CSR system
In previously published work, we have proposed a novel feature extraction algorithm, based on the Teager-Kaiser energy estimates, that approximates human auditory characteristics and that is more robust to sub-band noise than the mean-square estimates of standard MFCCs. We refer to the novel features as Teager energy cepstrum coef cients (TECC). Herein, we study the TECC performance under addit...
متن کاملMitochondrial ATPase and high-energy phosphates in failing hearts.
This study examined high-energy phosphates (HEP) and mitochondrial ATPase protein expression in hearts in which myocardial infarction resulted in either compensated left ventricular remodeling (LVR) or congestive heart failure (CHF). The response of HEP (measured via (31)P magnetic resonance spectroscopy) to a modest increase in the cardiac work state produced by dobutamine-dopamine infusion an...
متن کاملAcoustic front-end optimization for large vocabulary speech recognition
In this paper we describe experiments with the acoustic front{end of our large vocabulary speech recognition system. In particular, two aspects are studied: 1) linear transforms for feature extraction and 2) the modelling of the emission probabilities. Experiments are reported on a 5000{word task of the ARPA Wall Street Journal database. For the linear transforms our main results are: Filter{ba...
متن کامل